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) } ' Abstract. We review the problem of confounding in genetic associa- 

O ■ tion studies, which arises principally because of population structure 

| and cryptic relatedness. Many treatments of the problem consider only 

■ a simple "island" model of population structure. We take a broader 
CN . approach, which views population structure and cryptic relatedness as 

different aspects of a single confounder: the unobserved pedigree defin- 

["tI ■ ing the (often distant) relationships among the study subjects. Kinship 

^Tj \ is therefore a central concept, and we review methods of defining and es- 

^"2 ' timating kinship coefficients, both pedigree-based and marker-based. In 

C$ . this unified framework we review solutions to the problem of population 

\ structure, including family-based study designs, genomic control, struc- 
tured association, regression control, principal components adjustment 

, and linear mixed models. The last solution makes the most explicit use 

' of the kinships among the study subjects, and has an established role 

■ in the analysis of animal and plant breeding studies. Recent computa- 
\^ | tional developments mean that analyses of human genetic association 
"^j- ■ data are beginning to benefit from its powerful tests for association, 
Q . which protect against population structure and cryptic kinship, as well 
t-h ' as intermediate levels of confounding by the pedigree. 

O ! 
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^ . 1. CONFOUNDING IN GENETIC 

EPIDEMIOLOGY 
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an association indirectly through a genotyped locus 
that is nearby on the genome. In this review we are 
concerned with the task of guarding against spu- 
rious associations, those which do not arise at or 
near a causal locus. We first introduce background 
material describing linkage and association studies, 
population structure and linkage disequilibrium, the 
problem of confounding by population structure and 
cryptic relatedness. In Section 2 we discuss defini- 
tions and estimators of the kinship coefficients that 
are central to our review of methods of correcting for 
confounding by population structure and cryptic re- 
latedness, which is presented in Section 3. Finally, in 
Section 4 we present the results of a small simulation 
study illustrating the merits of the most important 
methods introduced in Section 3. 

Although association designs are used to study 
other species, we will mainly take a human-genetics 
viewpoint. For example, we will focus on binary phe- 
notypes, such as disease case/control or drug re- 
sponder/nonresponder, which remain the most com- 
monly studied type of outcome in humans, although 
quantitative (continuous), categorical and time-to- 
event traits are increasingly important. The subjects 
of an association study are sometimes sampled from 
a population without regard to phenotype, as in 
prospective cohort designs. However, retrospective 



ascertainment of individuals on the basis of pheno- 
type, as in case-control study designs, is more com- 
mon in human genetics, and we will focus on such 
designs here. 

Linkage studies (Thompson, 2007) form the other 
major class of study designs in genetic epidemiol- 
ogy. These seek loci at which there is correlation 
between the phenotype of interest and the pattern 
of transmission of DNA sequence over generations 
in a known pedigree. In contrast, association stud- 
ies are used to search for loci at which there is a 
significant association between the phenotypes and 
genotypes of unrelated individuals. These associa- 
tions arise because of correlations in transmissions 
of phenotypes and genotypes over many generations, 
but association analyses do not model these trans- 
missions directly, whereas linkage analyses do. The 
relatedness of study subjects is therefore central to a 
linkage study, whereas the relatedness of association 
study subjects is typically unknown and assumed to 
be distant; any close relatedness is a nuisance (Fig- 
ure 1). 

In the last decade, association studies have be- 
come increasingly prominent in human genetics, while, 
although they remain important, the role of link- 
age studies has declined. Linkage studies can pro- 
vide strong and robust evidence for genetic causa- 
tion, but are limited by the difficulty of ascertaining 
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Fig. 1. Schematic illustration of differences between linkage studies, which track transmissions in known pedigrees, and 
population association studies which assume "unrelated" individuals. Open circles denote study subjects for whom phenotype 
data are available and solid lines denote observed parent-child relationships. Dotted lines indicate unobserved lines of descent, 
which may extend over many generations, and filled circles indicate the common ancestors at which these lineages first diverge. 
Unobserved ancestral lineages also connect the founders of a linkage study, but these have little impact on inferences and 
are ignored, whereas they form the basis of the rationale for an association analysis and constitute an important potential 
confounder. 
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enough suitable families, and by insufficient recom- 
binations within these families to refine the loca- 
tion of a causal variant. When only a few hundred 
of genetic markers were available, lack of within- 
family recombinations was not a limitation. Now, 
cost-effective technology for genotyping ~10 6 single 
nucleotide polymorphism (SNP) markers distributed 
across the genome has made possible genome-wide 
association studies (GWAS) which investigate most 
of the common genetic variation in a population, 
and obtain orders of magnitude finer resolution than 
a comparable linkage study (Morris and Cardon, 
2007; Altshuler, Daly and Lander, 2008). GWAS are 
preferred for detecting common causal variants (say, 
population fraction > 0.05), which typically have 
only a weak effect on phenotype, whereas linkage 
studies remain superior for the detection of rare vari- 
ants of large effect (because these effects are more 
strongly concentrated within particular families). 

Because genes are essentially immutable during 
an individual's lifetime, and because of the inde- 
pendence of allelic transmissions at unlinked loci 
(Mendel's Second Law), linkage studies are virtually 
immune to confounding. Association studies are, how- 
ever, susceptible to genetic confounding, which is 
usually thought of as coming in two forms: popula- 
tion structure and cryptic relatedness. These are in 
fact two ends of a spectrum of the same confounder: 




Fig. 2. Schematic illustration of the confounding role of 
pedigree on ancestral lineages at individual loci. Two possi- 
ble single-locus lineages are shown (solid lines), each embed- 
ded in the pedigree of Figure 1 (right). Moving upwards from 
the study subjects (open circles), when two lineages meet at 
a common ancestor (filled circle), they either coalesce into a 
single lineage, or else they pass through different alleles of 
the common ancestor and do not coalesce. Dotted lines show 
pedigree relationships that do not contribute to the ancestry 
of the study subjects at this locus. Although lineages are ran- 
dom, they are constrained by the pedigree, features of which 
are therefore reflected in lineages across the genome. 



the unobserved pedigree specifying the (possibly dis- 
tant) relationships among the study subjects (Fig- 
ure 1, right). Association studies are also susceptible 
to confounding if genotyping error rates vary with 
phenotype (Clayton et al., 2005). This can resemble 
a form of population structure and is not discussed 
further here. 

We can briefly encapsulate the genetic confound- 
ing problem as follows. Association studies seek ge- 
nomic loci at which differences in the genotype dis- 
tributions between cases and controls indicate that 
their ancestries are systematically different at that 
locus. However, pedigree structure can generate a 
tendency for systematic ancestry differences between 
cases and controls at all loci not subject to strong 
selection. Figure 2 illustrates two possible ancestral 
lineages of the study subject alleles at a locus. Lin- 
eages are correlated because they are constrained to 
follow the underlying pedigree. For example, if the 
pedigree shows clustering of individuals into sub- 
populations, then ancestral lineages at neutral loci 
will tend to reflect this. The goal of correction for 
population structure is to allow for the confounding 
pedigree effects when assessing differences in ances- 
try between cases and controls at individual loci. In 
the following sections we seek to expand on this brief 
characterization. 

1.2 Population Structure 

Informally, a population has structure when there 
are large-scale systematic differences in ancestry, for 
example, varying levels of immigrant ancestry, or 
groups of individuals with more recent shared ances- 
tors than one would expect in a panmictic (random- 
mating) population. Shared ancestry corresponds to 
relatedness, or kinship, and so population struc- 
ture can be defined in terms of patterns of kinship 
among groups of individuals. Population structure 
is often closely aligned with geography, and in the 
absence of genetic information, stratification by ge- 
ographic region may be employed to try to iden- 
tify homogeneous subpopulations. However, this ap- 
proach does not account for recent migration or for 
nongeographic patterns of kinship based on social 
or religious groups. 

The simplest model of population structure as- 
sumes a partition of the population into "islands" 
(subpopulations). Mating occurs preferentially be- 
tween pairs of individuals from the same island, so 
that the island allele fractions tend to diverge to an 
extent that depends on the inter-island migration 
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rates. An enhancement of the island model to incor- 
porate admixture allows individual-specific propor- 
tions of ancestry arising from actual or hypothetical 
ancestral islands. 

Below we will focus on island models of population 
structure, because these are simple and parsimo- 
nious models that facilitate discussion of the main 
ideas. Moreover, several popular statistical methods 
for detecting population structure and correcting as- 
sociation analysis for its effects have been based 
entirely on such models. However, human popula- 
tion genetic and demographic studies suggest that 
island models typically do not provide a good fit for 
human genetic data. Colonization often occurs in 
waves and is influenced by geographic and cultural 
factors. Such processes are expected to lead to clinal 
patterns of genetic variation rather than a partition 
into subpopulations (Handley et al., 2007). Modern 
humans are known to have evolved in Africa with 
the first wave of human migration from Africa esti- 
mated to have been approximately 60,000 years ago. 
Reflecting this history, current human genetic diver- 
sity decreases roughly linearly with distance from 
East Africa (Liu et al., 2006). Within Europe, Lao 
et al. (2008) found that the first two principal com- 
ponents of genome- wide genetic variation accurately 
reflect latitude and longitude: there is population 
structure at a Europe-wide level, but no natural 
classification of Europeans into a small number of 
subpopulations. Similarly, there does not appear to 
be a simple admixture model based on hypothetical 
ancestral subpopulations that can adequately cap- 
ture European genetic variation, although a model 
based on varying levels of admixture from hypothet- 
ical "North Europe" and "South Europe" subpopu- 
lations could at least capture the latitude effect. The 
admixture model may be appropriate when the cur- 
rent population results from some intermixing fol- 
lowing large-scale migrations over large distances, 
such as in Brazil or the Caribbean. 

Because the term "population stratification" can 
imply an underlying island model, we avoid this 
term and adhere to "population structure," which 
allows for more complex underlying demographic 
models. 

1.3 Linkage Disequilibrium 

In a large, panmictic population, and in the ab- 
sence of selection, pairs of genetic loci that are not 
tightly linked (close together on a chromosome) are 
unassociated at the population level (McVean, 2007). 
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Fig. 3. Illustration of the role of linkage disequilibrium in 
generating phenotypic association with a noncausal genotyped 
marker due to a tightly-linked ungenotyped causal locus. 

Such linkage equilibrium arises because recombina- 
tion events ensure the independent assortment of al- 
leles when they are transmitted across generations 
(a process sometimes called Mendelian Randomiza- 
tion). Conversely, because recombination is rare (~1 
recombination per chromosome per generation), 
tightly linked loci are generally correlated, or in link- 
age disequilibrium (LD) in the population. This is 
because many individuals can inherit a linked al- 
lele pair from a remote common ancestor without 
an intervening recombination. Association mapping 
relies on LD because, even for a GWAS, only a small 
proportion of genetic variants are directly measured. 
Signals from ungenotyped causal variants can only 
be detected through phenotype association with a 
genotyped marker that is in sufficiently strong LD 
with the causal variant (Figure 3). LD is a double- 
edged sword: the stronger the LD around a causal 
variant, the easier it is to detect, because the greater 
the probability it is in high LD with at least one 
genotyped marker (Pritchard and Przeworski, 2001). 
However, in a region of high LD it is hard to fine- 
map a causal variant because there will be multi- 
ple highly-correlated markers each showing a similar 
strength of association with the phenotype. 

1.4 Spurious Associations due to 
Population Structure 

Unfortunately, population structure can cause LD 
between unlinked loci and consequently generate spu- 
rious marker-phenotype associations. For example, 
in the island model of population structure, if the 
proportion of cases among the sampled individuals 
varies across subpopulations, then alleles that vary 
in frequency across subpopulations will often show 
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association with phenotype. One or more such alleles 
may in fact be involved in phenotype determination, 
but standard association statistics may not distin- 
guish them from the many genome-wide alleles with 
frequencies that just happen to vary across subpop- 
ulations because of differential genetic drift or nat- 
ural selection. To express this another way, many 
alleles across the genome are likely to be somewhat 
informative about an individual's subpopulation of 
origin, and hence be predictive of any phenotype 
that varies across subpopulations. For example, in 
a large sample drawn from the population of Great 
Britain, many genetic variants are likely to show as- 
sociation with the phenotype "speaks Welsh." These 
will be alleles that are relatively common in Wales, 
which has a different population history from Eng- 
land (Weale et al., 2002), and do not "cause" speak- 
ing Welsh. 

Under an island model, one could potentially solve 
the problem of spurious associations by matching 
for ancestry, for example, by choosing for each case 
a control from the same subpopulation. However, as 
noted above, an island model is unlikely to describe 
the ancestry of a human population adequately. We 
each have a distinct pattern of ancestry, to a large 
extent unknown beyond a few generations, making 
precise matching impractical while crude matching 
may be insufficient. The spouse of a case, or another 
relative by marriage, can provide a genetically un- 
related control approximately matched for ancestry, 
but there are obvious limitations to this approach. 

There are at least three reasons why, in an un- 
matched study, the phenotypes of study subjects 
might vary systematically with ancestry (e.g., with 
subpopulation in an island model) . The most straight- 
forward reason is that the disease prevalence varies 
across subpopulations in accordance with the fre- 
quencies of causal alleles, and the differing sample 
casexontrol ratios across subpopulations reflect the 
differing subpopulation prevalences. Alternatively, 
subpopulation prevalences may vary because of dif- 
fering environmental risks. Third, ascertainment bias 
can make an important contribution to associations 
between ancestry and phenotype. Ascertainment bias 
can arise if there are differences in the sampling 
strategies between cases and controls that are corre- 
lated with ancestry. In the island model, this means 
that the sample casexontrol ratios across subpopu- 
lations do not reflect the subpopulation prevalences. 
This may happen, for example, because cases, but 
not controls, are sampled from clinics that over- 
represent particular groups. 



1.5 Extent of the Problem 

The vulnerability of association studies to con- 
founding by population structure has been recog- 
nized for many years. In a famous example, Knowler 
et al. (1988) found a significant association between 
an immunoglobulin haplotype and type II diabetes. 
The study subjects were native North Americans 
with some European ancestry and the association 
disappeared after stratification by ancestry. Many 
commentators fail to note that Knowler et al. un- 
derstood the problem and performed an appropriate 
analysis, so that no false association was reported: 
they merely noted the potential for confounding in 
an unstratified analysis. 

Marchini et al. (2004a) concluded from a simu- 
lation study that, even in populations with rela- 
tively modest levels of structure (such as Europe or 
East Asia), when the sample is large enough to pro- 
vide the required power, the most significant SNPs 
can have their p-values reduced by a factor of three 
because of population structure, thus exaggerating 
the significance of the association. Freedman et al. 
(2004) examined a study into prostate cancer in (ad- 
mixed) African Americans and estimated a similar 
reduction in the smallest p-values. Another study 
of European- Americans found a SNP in the lactase 
gene significantly associated with variation in height 
(Campbell et al., 2005). When the subjects were 
stratified according to North/ West or South/East 
European ancestry, the association disappeared. Since 
we expect connections among lactase tolerance, diet 
and height, the association could be genuine and in- 
volve different diets, but the confounding with popu- 
lation structure makes this difficult to establish. Hel- 
gason et al. (2005) used pedigree and marker data 
from the Icelandic population, and found evidence 
of population structure in rural areas, which would 
result on average in a 50% increase in the magnitude 
of a xf association statistic. 

Following Pritchard and Rosenberg (1999) and Gor- 
roochurn et al. (2004), Rosenberg and Nordborg (2006) 
considered a general model for populations with con- 
tinuous and discrete structure and presented neces- 
sary and sufficient conditions for spurious associa- 
tion to occur at a given locus. They defined a param- 
eter measuring the severity of confounding under 
general ascertainment schemes, and showed that, 
broadly speaking, the case of two discrete subpop- 
ulations is worse than the cases of either more sub- 
populations or an admixed population. As the num- 
ber of subpopulations becomes larger, the problem 
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of spurious association tends to diminish because the 
law of large numbers smoothes out correlation be- 
tween disease risk and allele frequencies across sub- 
populations (Wang, Localio and Rebbeck, 2004). 

In recent years results have been published from 
hundreds of GWAS into complex genetic traits 
(NHGRI GWAS Catalog, 2009). McCarthy et al. 
(2008) described the current consensus. The impact 
of population structure on association studies should 
be modest "as long as cases and controls are well 
matched for broad ethnic background, and measures 
are taken to identify and exclude individuals whose 
GWAS data reveal substantial differences in genetic 
background." This is consistent with a report from 
a study of type II diabetes in UK Caucasians which 
estimated that population structure was responsi- 
ble for only ~4% inflation in x% association statis- 
tics (Clayton et al., 2005). The Wellcome Trust Case 
Control Consortium (2007) study of seven common 
diseases using a UK population sample found fewer 
than 20 loci exhibiting strong geographic variation. 
The genome- wide distribution of test statistics sug- 
gested that any confounding effect was modest and 
no adjustment for population structure was made 
for the majority of their analyses. 

In conclusion, the magnitude of the effect of struc- 
ture depends on the population sampled and the 
sampling scheme, and well-designed studies should 
usually suffer only a small impact. However, most 
of the associated variants so far identified by GWAS 
have been of small effect size (NHGRI GWAS Cat- 
alog, 2009), and as study sizes increase in order to 
detect smaller effects, even modest structure could 
substantially increase the risk of false positive asso- 
ciations. 

1.6 Cryptic Relatedness 

Cryptic relatedness refers to the presence of close 
relatives in a sample of ostensibly unrelated indi- 
viduals. Whereas population structure generally de- 
scribes remote common ancestry of large groups of 
individuals, cryptic relatedness refers to recent com- 
mon ancestry among smaller groups (often just pairs) 
of individuals. Like population structure, cryptic re- 
latedness often arises in unmatched association stud- 
ies and can have a confounding effect on inferences. 
Indeed, Devlin and Roeder (1999) argued that cryp- 
tic relatedness could pose a more serious confound- 
ing problem than population structure. A subse- 
quent theoretical investigation of plausible demo- 
graphic and sampling scenarios (Voight and 



Pritchard, 2005) showed that the effect of cryptic 
relatedness in well-designed studies of outbred pop- 
ulations should be negligible, but it can be notice- 
able for small and isolated populations. Using pedi- 
gree and empirical genotype data from the Hutterite 
population, these authors found that cryptic relat- 
edness reduces an association p-value of 10~ 3 by a 
factor of approximately 4, and that the smaller the 
p-value the greater is the relative effect. 

2. GENETIC RELATIONSHIPS 

2.1 Kinship Coefficients Based on 
Known Pedigrees 

The relatedness between two diploid individuals 
can be defined in terms of the probabilities that 
each subset of their four alleles at an arbitrary lo- 
cus is identical by descent (IBD), which means that 
they descended from a common ancestral allele with- 
out an intermediate mutation. The probability that 
the two homologous alleles within an individual i 
are IBD is known as its inbreeding coefficient, fa. 
When no genotype data are available, IBD prob- 
abilities can be evaluated from the distribution of 
path lengths when tracing allelic lineages back to 
common ancestors (Figure 2), convolved with a mu- 
tation model (Malecot, 1969). More commonly, IBD 
is equated with "recent" common ancestry, where 
"recent" may be defined in terms of a specified, 
observed pedigree, whose founders are assumed to 
be completely unrelated. In theoretical models, "re- 
cent" may be defined, for example, in terms of a 
specified number of generations, or since the last 
migration event affecting a lineage. Linkage analy- 
sis conditions on the available pedigree, and in this 
case the definition of IBD in terms of shared an- 
cestry within that pedigree, and the assumption of 
unrelated founders, cause no difficulty. However, the 
strong dependence on the observed pedigree, or other 
definition of "recent" shared ancestry, is clearly un- 
satisfactory for a more general definition of related- 
ness. 

A full description of the relatedness between two 
diploid individuals requires 15 IBD probabilities, one 
for each nonempty subset of four alleles, but if we re- 
gard the pair of alleles within each individual as un- 
ordered, then just eight identity coefficients (Jacquard, 
1970) are required (Figure 4). An assumption of no 
within-individual IBD (no inbreeding) allows these 
eight coefficients to be collapsed into two (Cotter- 
man, 1940), specifying probabilities for the two in- 
dividuals to share exactly one and two alleles IBD. 
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Fig. 4. Schematic illustration of the nine relatedness classes 
for two individuals, whose four alleles are indicated by filled 
circles, that are specified by the eight Jacquard identity-by- de- 
scent (IBD) coefficients. Within-individual allele pairs are re- 
garded as unordered, and solid lines link alleles that are IBD. 



Both these coefficients are required for models in- 
volving dominance, but for additive genetic models 
they can be reduced to a single kinship coefficient, 
Kij, which is the probability that two alleles, one 
drawn at random from each of i and j, are IBD. 
Similarly, Ka is the probability that two alleles, 
sampled with replacement from i, are IBD. Thus, 
Ka = (1 + fi)/2, and, in particular, the kinship of 
an outbred individual with itself is 1/2. 

The kinship matrix K of a set of individuals in a 
pedigree can be computed by a recursive algorithm 
that neglects within-pedigree mutation (Thompson, 
1985). K is positive semi-definite if the submatrix 
of assumed founder kinships is positive semi-definite 
(which is satisfied if, as is typical, founders are as- 
sumed unrelated). 

2.2 Kinship Coefficients Based on Marker Data 

The advent of GWAS data means that genome- 
average relatedness can now be estimated accurately. 
It can be preferable to use these estimates in associ- 
ation analyses even if (unusually) pedigree-based es- 
timates are available. There is a subtle difference be- 
tween expectations computed from even a full pedi- 
gree, and realized amounts of shared genomic ma- 
terial. For example, if two lineages from distinct in- 
dividuals meet in a common ancestor many genera- 
tions in the past, then this ancestor will contribute 
(slightly) to the pedigree-based relatedness of the 
individuals but may or may not have passed any ge- 
netic material to both of them. Similarly, two pairs 



of siblings in an outbred pedigree may have the same 
pedigree relatedness, but (slightly) different empiri- 
cal relatedness (Weir, Anderson and Hepler, 2006). 

Thompson (1975) proposed maximum likelihood 
estimates (MLEs) of the Cotterman coefficients, while 
Milligan (2003) made a detailed study of MLEs un- 
der the Jacquard model. These MLEs can be prone 
to bias when the number of markers is small and 
can be computationally intensive to obtain partic- 
ularly from genome- wide data sets (Ritland, 1996; 
Milligan, 2003). 

Method of moments estimators (MMEs) are typi- 
cally less precise than MLEs, but are computation- 
ally efficient and can be unbiased if the ancestral 
allele fractions are known (Milligan, 2003). Under 
many population genetics models, if two alleles are 
not IBD, then they are regarded as random draws 
from some mutation operator or allele pool (Rous- 
set, 2002), which corresponds to the notion of "un- 
related." The kinship coefficient is then a cor- 
relation coefficient for variables indicating whether 
alleles drawn from each of i and j are some given 
allelic type, say, A. If Xi and Xj count the numbers 
of A alleles (0, 1 or 2) of i and j, then 



(2.1) 



Cov(xj, xj) = 4p(l — p)K, 



u - 



where p is the population fraction of A alleles. Thus, 
can be estimated from genome- wide covariances 
of allele counts. Specifically, if we write col- 
umn vector over individuals and let the subscript 
index the L loci (rather than individuals), then 



(2.2) 



K 



1 L 



(x l -2p l l)(x l -2p l l) 



i=i 



4pi(l-pi) 



is an unbiased and positive semi-definite estimator 
for the kinship matrix K. Entries in K can also be 
interpreted in terms of excess allele sharing beyond 
that expected for unrelated individuals, given the 
allele fractions. According to Ritland (1996), who 
considered similar estimators and gave a generaliza- 
tion to loci with more than two alleles, (2.2) was 
first given in Li and Horvitz (1953) but only for in- 
breeding coefficients. 

In practice, we do not know the allele fractions 
p\. The natural estimators assume outbred and un- 
related individuals, deviation from which can exag- 
gerate the downward bias in the Ky estimates that 
arises from the overfitting effect of estimating the 
Pi from the same data. To reduce the first problem, 
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one could iteratively re-estimate the pi after making 
an initial estimate of K with 

A \ T k~ 1 x l 

Although the correlations arising from shared an- 
cestry are in principle positive, because of bias aris- 
ing from estimation of the pi , off-diagonal entries of 
(2.2) can be negative, a property that has caused 
some authors to shun such estimators of K (Milli- 
gan, 2003; Yu et al., 2006; Zhao et al., 2007). Rous- 
set (2002) also criticized the model underlying (2.1) 
in the context of certain population genetics mod- 
els, but did not propose an alternative estimator 
of genetic covariance in actual populations. For our 
purpose, that of modeling phenotypic correlations, 
genotypic correlations seem intuitively appropriate 
and the interpretation of Kij probability seems 
unimportant. Under the interpretation of as ex- 
cess allele sharing, negative values correspond to in- 
dividuals sharing fewer alleles than expected given 
the allele frequencies. 

Table 1 shows the probability that alleles chosen 
at random from each of two individuals match, that 
is, are identical by state (IBS), at a genotyped dial- 
lelic locus. The genome-wide average IBS probabil- 
ity can be expressed as 

1 L 1 
(2-3) ^( a .,-i)( Xl _if + 

1=1 

If the mutation rate is low, IBS usually arises as a 
result of IBD, and (2.3) can be regarded as an MME 
of the pedigree-based kinship coefficient in the limit- 
ing case that IBS implies IBD. This estimator over- 
comes the problem with pedigree-based estimators 
of dependence on the available pedigree, but it is 
sensitive to recurrent mutations. 

Software for computing average allele sharing (IBS) 
is included in popular packages for GWAS analy- 
sis such as PLINK (Purcell et al., 2007). However, 
because the excess allele-sharing (genotypic corre- 
lation) estimator of kinship coefficients (2.2) incor- 
porates weighting by allele frequency, it is typically 
more precise than (2.3). Sharing a rare allele sug- 
gests closer kinship than sharing a common allele, 
because the rare allele is likely to have arisen from 
a more recent mutation event (Slatkin, 2002). To il- 
lustrate the increased precision of (2.2) over (2.3), 
we simulated 500 genetic data sets comprising 200 
idealized cousin pairs (no mutation, and the alleles 



not IBD from the common grandparents were inde- 
pendent draws from an allele pool) and 800 unre- 
lated individuals, all genotyped at 10,000 unlinked 
SNPs. After rescaling to ensure the two estimators 
give the same difference between the mean kinship 
estimate of cousin pairs and mean kinship estimate 
of unrelated pairs, the resulting standard deviations 
(Table 2) are about 40% larger for the total allele 
sharing (IBS) estimator (2.3) than for the excess 
allele-sharing (genetic correlation) estimator (2.2). 

The marker-based estimates of kinship coefficients 
discussed above do not take account of LD between 
markers, nor do they exploit the information about 
kinship inherent from the lengths of genomic re- 
gions shared between two individuals from a recent 
common ancestor (Browning, 2008). Hidden Markov 
models provide one approach to account for LD 
(Boehnke and Cox, 1997; Epstein, Duren and 
Boehnke, 2000). In outbred populations, the IBD 
status along a pair of chromosomes, one taken from 
each of a pair of individuals in a sibling, half sib 
or parent-child relationship, is a Markov process. 
However, the Markovian assumption fails for more 
general relationships in outbred populations. When 
relationships are more distant, regions of IBD will 
tend to cluster. For example, in the case of first 
cousins IBD regions will cluster into larger regions 
that correspond to inheritance from one of the two 
shared grandparents. McPeek and Sun (2000) showed 
how to augment the Markov model to describe the 
IBD process when the chromosomes correspond to 
an avuncular or first cousin pair. Despite the inva- 
lidity of the Markov assumption, Leutenegger et al. 
(2003) found that in practice it can lead to reason- 
able estimates for relationships more distant than 
first-degree. 

3. CORRECTING ASSOCIATION ANALYSIS 
FOR CONFOUNDING 

In this review, we seek to use kinship to illuminate 
connections among popular methods for protecting 

Table 2 

Estimated standard deviations of two kinship coefficient 
MMEs, after linear standardization to put the estimates on 
comparable scales 

Estimator Unrelated pair Cousin pair 

Genetic correlation (2.2) 5.0 5.3 

IBS (2.3) 7.3 7.2 
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Table 1 

Identity-by state (IBS) coefficients at a single diallelic locus, defined as the probability that alleles drawn at random from i 
and j match, which gives 0.5 in the case of a pair of hetero zygotes. Another definition, based on the number of alleles in 

common between i and j, gives 1 for a pair of heterozygotes 
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association analyses from confounding. Many of these 
methods can be formulated within standard regres- 
sion models that express the expected value of yi, 
the phenotype of the ith individual, as a function of 
its genotype Xi at the SNP of interest: 

(3.1) g(E[y l ]) = a + x i p, 

where, for simplicity, we have not included covari- 
ates. Here g is a link function and f3 is a scalar or col- 
umn vector of genetic effect parameters at the SNP. 
Often Xi counts the number of copies of a specified 
allele carried by i, or it can be a two-dimensional 
row vector that implies a general genetic model. 

For a case-control study, g is typically the logit 
function and /3 are log odds ratios. This is a prospec- 
tive model, treating case-control status as the out- 
come, but inferences about /3 are typically the same 
as for the retrospective model, which is more ap- 
propriate for case-control data (Prentice and Pyke, 
1979; Seaman and Richardson, 2004). However, in 
some settings ascertainment effects are not correctly 
modeled prospectively, and it is necessary to con- 
sider retrospective models of the type 

(3.2) g(E[xi]) = a + yi p, 

where g is typically the identity function. 

3.1 Family-Based Tests of Linkage and 
Association (FBTLA) 

The archetypal FBTLA is the transmission dis- 
equilibrium test (TDT) (Spielman, McGinnis and 
Ewens, 1993) for systematic differences between the 
genotypes of affected children and those expected 
under Mendelian randomization of the alleles of their 
unaffected parents. If an allele is directly 
risk-enhancing, it will be over-transmitted to cases. 
If not directly causal but in LD with a causal allele, 
it may also be over-transmitted, but in this case it 
must also be linked with the causal variant, since 
otherwise Mendelian randomization will eliminate 
the association between causal and tested alleles. 



Thus, the TDT is a test for both association and 
linkage. The linkage requirement means that the test 
is robust to population structure, while the associa- 
tion requirement allows for fine-scale localization. 

Parents that are homozygous at the tested SNP 
are uninformative and not used. Transmissions from 
heterozygote parents are assumed to be indepen- 
dent, which implies a multiplicative disease model. 
Let n a and denote respectively the number of a 
and A alleles transmitted to children by Aa heterozy- 
gote parents. If there is no linkage, each parental al- 
lele is equally likely to be transmitted, so that the 
null hypothesis for the TDT is 

Ho:E[n a ]=E[n A ]. 

Conditional on the number of heterozygote parents 
n a + n A) the test statistic n a has a Binomial (n a + n A , 
1/2) null distribution, but McNemar's statistic 

(3.3) (".-"a) 2 
n 3 + ra A 

which has an approximate xf nun distribution 
(Agresti, 2002), is widely used instead. The TDT 
can be derived from the score test of a logistic re- 
gression model in which transmission is the out- 
come variable, and the parental genotypes are pre- 
dictors (Dudbridge, 2007). In Section 3.3 we outline 
a test which can exploit between-family as well as 
within-family information when it is available, while 
retaining protection from population structure. Ti- 
wari et al. (2008) survey variations of the TDT in 
the context of a review of methods of correction for 
population structure. 

The main disadvantages of the TDT and other 
FBTLA are the problem of obtaining enough fami- 
lies for a well powered study (particularly for adult- 
onset diseases) and the additional cost of genotyp- 
ing: three individuals must be genotyped to obtain 
the equivalent of one case-control pair, and homozy- 
gous parents are uninformative. Given the availabil- 
ity of good analysis-based solutions to the prob- 
lem of population structure (see below), the design- 
based solution of the FBTLA pays too high a price 
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for protection against spurious associations (Car- 
don and Palmer, 2003). However, FBTLA designs 
(like other linkage designs) can also be used to in- 
vestigate parent-of-origin effects (Weinberg, 1999), 
which is not usually possible for population-based 
case-control studies. 

3.2 Genomic Control 

Genomic Control (GC) is an easy-to-apply and 
computationally fast method for reducing the infla- 
tion of test statistics caused by population structure 
or cryptic relatedness. It can be applied to data of 
any family structure or none. GC was developed 
(Devlin and Roeder, 1999) for the Armitage test 
statistic, which is asymptotically equivalent to a score 
statistic under logistic regression (Agresti, 2002) and, 
in the absence of confounding, has an asymptotic Xi 
null distribution. The Armitage test assumes an ad- 
ditive disease model, but GC has also been adapted 
for tests of other disease models (Zheng et al., 2005; 
Zheng, Freidlin and Gastwirth, 2006). 

Figure 5(A) illustrates the inflation of Armitage 
test statistics at 2000 null SNPs simulated under 
an island model with admixture and ascertainment 
bias. This inflation could reflect many genome- wide 
true associations, but it is more plausible (and cor- 
rect for this simulation) that the inflation is due to a 
combination of population structure and ascertain- 
ment bias. The figure suggests that the inflation of 
test statistics is approximately linear, and Devlin 
and Roeder argued that this holds more generally. 
They therefore proposed to calibrate the type I er- 
ror of the Armitage test by adjusting all test statis- 
tics by a constant factor A. This leaves the ranking 
of markers in terms of significance unchanged [Fig- 
ure 5(B)], and so GC is equivalent to adjusting the 
significance threshold. 

For most complex phenotypes, only a few genome- 
wide SNPs correspond to strong causal associations, 
with test statistics in the upper tail of the empirical 
distribution. Consequently, the bulk of the empirical 
distribution, away from the upper tail, should reflect 
the null distribution and can be used to estimate A. 
Bacanu, Devlin and Roeder (2000) suggested esti- 
mating A as the ratio of the empirical median to its 
null value (=0.455), because the median is robust to 
a few large values in the upper tail. For the simula- 
tion of Figure 5, the median of the test statistics is 
0.59, leading to A = 1.31, a large value reflecting the 
strong ascertainment bias. 



Setakis, Stirnadel and Balding (2006) pointed out 
that ascertainment bias can cause median-adjusted 
GC to be very conservative. Marchini et al. (2004a) 
had previously noticed that for strong population 
structure GC can be anti-conservative when the num- 
ber of test statistics used to estimate A is <100, and 
conservative when the number is 3> 100. Devlin, Ba- 
canu and Roeder (2004) ascribed this problem to 
failure to account for the uncertainty in the estimate 
of A, but Marchini et al. (2004b) noted that this 
may not be the most important cause of the problem 
(see also below). To allow for this uncertainty, De- 
vlin, Bacanu and Roeder (2004) suggested using the 
mean of the test statistics to estimate A, since the 
mean-adjusted test statistics have an Fi i7n null dis- 
tribution. In the absence of true associations, Dadd, 
Weale and Lewis (2009) found mean-adjusted GC to 
be slightly superior to median adjustment. However, 
the median is more robust to true positives than the 
mean. As a compromise, Clayton et al. (2005) pro- 
posed adjusting on a trimmed mean, discarding say 
the highest 5% or 10% of test statistics. 

Lemma 1. The mean of the smallest 100q% val- 
ues in a large random sample of \\ statistics has 
expected value 

^d 3 (d7 l (q)), 

where d^ is the distribution function of a \\ random 
variable. 

Proof. Let X ~ xf, then 

fdiHq) l e ~x/2 

E(X\X <d7 1 (q))= / x — _ dx 

Jo qV2iT V x 

p^Ho) i 
= / — -=\/xe x/ dx 
Jo q\/2ir 

= ±d 3 (d7 1 (q)). n 

A limitation of all GC methods is that they do 
not distinguish markers at which the pattern of as- 
sociation is correlated with the underlying pedigree 
from those at which the pedigree does not contribute 
to the association and so for which no adjustment 
should be necessary. Figure 5(B) and (C) shows that 
median-GC-adjustment performs similarly to PC- 
adjustment (see Section 3.6 below) in countering the 
overall inflation of test statistics, but the corrected 
statistics can be very different [Figure 5(D)] because 
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Fig. 5. Q-Q plots for likelihood ratio tests of association in logistic regression (equivalent to the Armitage trend test), at 
2000 null SNPs simulated under a three-island model with Fst = 1%- From Island 1 there are 200 controls and 100 cases. 
Each of the remaining 700 individuals is admixed, the ith individual having a proportion ai of their ancestry from Island 2, 
the remainder from Island 3, where the ai are independent and Uniform(0, 1) . The ith admixed individual has a probability 
0.3 + 0.5 x Oi of being a case, so that case status is positively correlated with Island 2 ancestry. (A) expected versus observed 
quantiles, unadjusted; (B) expected versus observed after GC median- adjustment; (C) expected versus observed when the first 
two principal components are included as covariates; (D) GC-adjusted versus PC-adjusted quantiles. 



PC-adjustment is SNP-specific. GC often shows re- 
duced power to detect association compared to rival 
methods for adjusting for population structure. 

In the remainder of this section we show connec- 
tions between A and the kinship of study subjects. 
The Armitage test statistic can be written as T 2 /V , 
where T is the difference between the allele fractions 
in the samples of n\ cases and no controls, 



no 



%ii 



V is an estimate of the variance of T, 

\hq rii J \n \_n J 



and n = hq + n%. In the following we assume ret- 
rospective ascertainment, so that the case/control 
status y is fixed by the study design, while the allele 
count Xi is random. Devlin and Roeder (1999) no- 
ticed that E[T] = 0, irrespective of population struc- 
ture, but that Var[T] can be inflated relative to V. 
In general, 



(3.4) 



i,3 



ViVj . ( 1 -?/i)( 1 -yi) 
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• Cov(xi,Xj), 
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and substituting (2.1) into (3.4) leads to 
V a r[T] = Ap{1 - p \ D + R), 



where 



D 



n m 



V — yi + — (i 



K, 



. . I no ni , r , 
> mm — , — Ir I _K I , 
\m no' 



R 



(3.5) 



>, —ViVj + — 1 

-^(Vi-VjfKij. 

i+0 



yi){ l -yj) )Kij 



It also follows that 

4p(l -p) 



(3.6) E[y] 



n n x 



K,, 



hi 



which reduces to 2p(l — p)n/non\ if all study sub- 
jects are outbred and unrelated. Thus, provided that 



(3.7) 
we have 
A = E 



V 



1 



as n - 



oo, 



±1 



E[V] 
„ Var[T] 



D + R 



Since K is positive semi-definite, the second sum- 
mation in (3.6) is > 0, so that 



A> 
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R ( n n x 

— -j— f >mm — ,— 
Tr A \m n 



+ 



i? 



Tr[A]" 



The dominant quantity bounding A is R/Tr[K] and 
since TrfiiT] oc n, 



R 



Tr[K] 



R/n. 



The large n behavior of A depends on that of R. 
From (3.5) we see that increasing levels of kinship 
either among cases or among controls will tend to 
increase R, while greater case-control kinship tends 
to reduce R. In the worst-case scenario, a typical 
individual will be related at each degree of kinship 
to a fixed proportion of the study sample as n varies 
so that, unless the average kinship among cases and 
among controls is balanced by the average 



case-control kinship (such as when cases and con- 
trols are matched within subpopulations under an 
island model) , R oc n 2 and 



A- 



which generalizes the result of Devlin and Roeder 
(1999). In practice, it is unclear how A should scale 
with n in well designed GWAS studies of homoge- 
neous populations. 

The statistic T 2 /XV has the correct median or 
mean, but it will not have an asymptotic Xi nu h dis- 
tribution unless (3.7) holds. This condition does not 
hold, for example, in an island model with a fixed 
number of islands. This fact may underlie the poor 
performance of GC in the settings discussed above. 
See Zheng et al. (2010) for a detailed discussion of 
variance distortion in GC. 

3.3 Explicit Modeling of Genetic Correlations 

GC is designed to correct the null distribution of 
a test statistic derived under a probability model 
which is invalid in the presence of structure. An al- 
ternative strategy is to derive a statistic from a prob- 
ability model that better reflects the actual data 
generating process. For a retrospective study, the 
individual allele counts should be modeled as ran- 
dom variables satisfying (2.1). If we are prepared to 
assume the higher order moments of x are small, 
this can be achieved with a regression model of the 
form (3.2) with linear link and residuals 

x- la-yl3~N(0,a 2 K). 

If we assign to a a diffuse Gaussian prior N(0, t -2 ), 
with r 4 0, and to a 2 the improper (Jeffreys) prior 
with density proportional to cr -2 , we can derive the 
score statistic T 2 /V, with a \\ asymptotic distribu- 
tion, where 



and 



for 



(3.8) 



(n-l)V 



y T Px 



rp rp 

y Py ■ x Px 



(fPxf 



P = K~ 



l T K~ l l ' 

If the subjects are unrelated (2K = I), then T 
reduces to a comparison between the mean allele 
counts in cases and controls, as in the Armitage test, 
but the variance V is slightly smaller due to the final 
term. When the relatedness between study subjects 
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is unknown, as in a typical GWAS, the estimate K 
of (2.2) may be substituted for K. A similar ap- 
proach, but with a different form for V, has recently 
been proposed by Rakovski and Stram (2009), who 
point out that when the kinships are known T 2 /V 
is equivalent to the QLS statistic of Bourgain et al. 
(2003). 

The test described here could be used to analyze 
family data, either using K from the known pedigree 
or K estimated from genotype data. For example, 
for pedigree-based K in trios of two unrelated and 
unaffected parents and an affected child, T 2 matches 
(up to a constant) the numerator of (3.3). V differs 
slightly from the denominator of (3.3), reflecting the 
fact that the TDT conditions on the parental geno- 
types, whereas the test described here treats them as 
random. If the kinships are estimated from genome- 
wide marker data, this test can exploit the between- 
family as well as the within-family information, thus 
potentially increasing power over FBTLA, while re- 
taining protection from population structure. More- 
over, this is a very general approach, which applies 
for any ascertainment scheme and degree of related- 
ness or population structure among study subjects. 
Thus, if a researcher were unaware of the TDT, but 
applied the retrospective regression model to family 
trio data, s/he would automatically "invent" a test 
similar to the TDT but with potentially superior 
properties. 

3.4 Structured Association 

Structured association (SA) methods are based on 
the island model of population structure, and as- 
sume that the ancestry of each individual is drawn 
from one or more of the "islands." Popular soft- 
ware packages include ADMIXMAP (Hoggart et al., 
2003) and STRUCTURE/STRAT (Pritchard and 
Donnelly, 2001; Falush, Stephens and Pritchard, 
2003). These approaches model variation in ances- 
tral subpopulation along a chromosome as a Markov 
process. Stratified tests for association (Clayton, 
2007), such as the Mantel-Haenszel test, can then be 
performed to combine signals of association across 
subpopulations. More generally, a logistic regression 
model of the form (3.1) can be employed, with ad- 
mixture proportions (one for each subpopulation) 
entering as covariates. 

Similar to GC, SA methods can be effective us- 
ing only ~10 2 SNPs, but unlike GC, they can be 
computationally intensive, although a simplified and 



fast version of SA is implemented in PLINK (Pur- 
cell et al., 2007). The number of subpopulations can 
be estimated from the data by optimizing a mea- 
sure of model goodness of fit, but this increases the 
computational burden and there is usually no satis- 
factory estimate because, as we noted above in Sec- 
tion 1.2, the island model is not well suited to most 
human populations. Indeed, ADMIXMAP was pri- 
marily designed for admixture mapping, in which 
the genomes of admixed individuals are scanned for 
loci at which cases show an excess of ancestry from 
one of the founder populations (McKeigue, 2007). 
Because of the limited number of generations since 
the admixture event, this approach has features in 
common with linkage as well as association study 
designs. 

3.5 Regression Control 

Wang, Localio and Rebbeck (2005) showed that it 
is possible to control for population structure within 
a logistic regression model of the form (3.1) by in- 
cluding among the covariates the genotype at a sin- 
gle marker that is informative about ancestry. Se- 
takis, Stirnadel and Balding (2006) proposed using 
a set of ~10 2 widely-spaced SNPs, which are as- 
sumed to be noncausal (in practice, a "random" set 
of SNPs). These null SNPs are informative about the 
underlying pedigree, which we have argued forms 
the basis of the problem of inflation of test statistics 
due to population structure. Including these SNPs 
as regression covariates while testing a SNP of in- 
terest should eliminate most or all of the pedigree 
(population structure) effect. 

Setakis, Stirnadel and Balding (2006) suggest two 
standard procedures to avoid overfitting the SNP 
covariates: a backward (stepwise) selection and a 
shrinkage penalty approach. In the absence of ascer- 
tainment bias, both methods performed similarly to 
GC and SA, while being computationally fast and 
allowing the flexibility of the regression framework. 
With ascertainment bias, the regression control ap- 
proach substantially outperformed GC. 

Another approach (Epstein, Allen and Satten, 2007), 
which is also related to propensity score methods, 
uses ancestry-informative SNPs to create a risk score, 
stratifies study subjects according to this score, and 
performs a stratified test of association. 

3.6 Principal Component Adjustment 

Zhang, Zhu and Zhao (2003) proposed controlling 
for population structure in quantitative trait asso- 
ciation analysis by including principal components 
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(PCs) of genome-wide SNP genotypes as regression 
covariates. Price et al. (2006) presented a similar 
method, focusing on its application to case-control 
GWAS. PC regression is similar to the regression 
control approach of Setakis, Stirnadel and Balding 
(2006), but minimizes overfitting by using only a 
few linear combinations of SNPs (the PCs), rather 
than a larger number of individual SNPs. However, 
many more SNPs (typically ~10 ) are required in 
the PC-based approach. 

Let X denote a matrix with n rows correspond- 
ing to individuals and L columns corresponding to 
SNPs. Genotypes are initially coded as allele counts 
(0, 1 or 2) but are then standardized to have zero 
mean and unit variance. Then the n x n matrix 
XX T /L is the estimated kinship matrix K intro- 
duced at (2.2). Since K is symmetric and positive 
semi-definite, it has an eigenvalue decomposition 

(3.9) K=\xX T = vAv T , 

where the columns of v are the eigenvectors, or PCs, 
of K, while A is a diagonal matrix of corresponding 
(nonnegative) eigenvalues in decreasing order. 

Standard principal components analysis uses the 
L x L matrix X T X specifying the empirical cor- 
relations between the columns of the design ma- 
trix. Here, the variables of interest are the individ- 
uals, corresponding to rows, and, hence, we focus 
on K = XX T /L. However, because X is column- 
standardized, and not row-standardized, K is not 
an empirical correlation matrix. In particular, the 
diagonal entries of X T X are all one, whereas the di- 
agonal entries of K vary over individuals according 
to their estimated inbreeding coefficient (Section 2). 

To maximize the empirical variance of X T v\, the 
first PC v\ will typically be correlated with many 
SNPs. For example, in a two-island model, it will 
have greatest correlation with the SNPs whose al- 
lele fractions are most discrepant between the two 
islands. Thus, v\ acts as a strong predictor of island 
membership, and can also identify admixed individ- 
uals (intermediate scores). More generally, in an S- 
island model the first 5 — 1 PCs predict island mem- 
berships and individual admixture proportions (Pat- 
terson, Price and Reich, 2006) (see Figure 6 for an 
illustration with S = 3). The subsequent PCs repre- 
sent the within- island pedigree effects. Cryptic kin- 
ship typically generates weaker LD than large-scale 
population structure, and the effects of small groups 
of close (even first-degree) relatives are usually not 
reflected in the leading PCs. 
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Fig. 6. First two principal component scores for the cases 
(green) and controls (red) of the simulation underlying Fig- 
ure 5. Triangles indicate individuals from Island 1, and circles 
the admixed individuals with ancestry from Islands 2 and 3. 

Tightly linked SNPs tend to be in high LD with 
each other, and sometimes one or more of the leading 
PCs will be dominated by large LD blocks. Since 
such blocks are genomically local, they convey little 
if any information about population structure. One 
way to avoid this problem is to filter GWAS SNPs 
prior to extracting the PCs, to exclude one in each 
pair of high LD SNPs. 

As for structured association and regression con- 
trol, the idea motivating PC adjustment is that if 
a correlation between the phenotype and the tested 
SNP can be partly explained by measures of an- 
cestry, here PCs, then including these as regression 
covariates prevents that part of the signal from con- 
tributing to the test statistic. In particular, in the 
case of linear regression, only the components of the 
phenotype and genotype vectors that are orthogo- 
nal to the PCs included in the model contribute to 
the test statistic. This should protect against spu- 
rious associations, provided that sufficient PCs are 
included in the model to explain potentially con- 
founding structure. For example, in the population 
of Figure 6, v\ and v<i can, with accuracy, jointly 
predict the proportion of an individual's ancestry 
arising from each of the three subpopulations, and 
because of the varying case-control ratios across the 
subpopulations, they can also to some extent predict 



POPULATION STRUCTURE 



15 



case-control status. Including the PCs as covariates 
in the regression model discards information, for ex- 
ample, indicating that alleles common in the high- 
risk subpopulation are more likely to be causal. Be- 
cause of the danger of ascertainment bias in retro- 
spective studies, such information may be dangerous 
and it is safer to discard it, but if ascertainment is 
not a problem, for example, in a prospective study, 
discarding this information is inefficient. 

For computational reasons the EIGENSTRAT soft- 
ware (Price et al., 2006), which implements PC ad- 
justment, does not include PCs as logistic regres- 
sion covariates, but instead uses a linear adjust- 
ment of both phenotypes and genotypes. Such an 
adjustment is valid only under the assumption that 
the yi form a homoskedastic sample (Agresti, 2002, 
page 120) and should be reasonable if the sample 
casexontrol ratio is not too far from 1, and the ef- 
fect sizes are small. 

By default, the EIGENSTRAT software includes 
the first ten PCs. Patterson, Price and Reich (2006) 
proposed a test to determine whether the lead eigen- 
value of K is significantly larger than one would 
expect under a null model, but it remains unclear 
what significance threshold for this test might be 
appropriate, if any, for protecting association test 
statistics from inflation. As for any other regression 
covariate, there is an argument for only including 
a PC in the model if it shows an association with 
the phenotype (Novembre and Stephens, 2008; Lee, 
Wright and Zou, 2010). Experience seems to suggest 
that between 2 and 15 PCs are typically sufficient, 
and in large studies for which n may be several thou- 
sand, these will correspond to a small loss of total 
genotypic information. 

While the intuition motivating PC-adjustment is 
valid under an island model, protection from popu- 
lation structure effects is not guaranteed under more 
complicated and realistic models of population struc- 
ture. In particular, inflation of test statistics due to 
cryptic relatedness is not ameliorated by PC adjust- 
ment. Moreover, if leading PCs reflect genome- local 
effects, PC adjustment could lose valuable informa- 
tion and lead to true effects being missed. 

3.7 Mixed Regression Models 

We assume here a quantitative phenotype with 
g the identity link. The linear mixed model (MM) 
extends (3.1) by including for each individual i a 
latent variable 5i such that 

(3.10) E[y i \5 i ]=a + x i p + S i . 



The value of 5i is interpreted as a polygenic contri- 
bution to the phenotype, due to many small, addi- 
tive, genetic effects distributed across the genome. 
In animal breeding genetics, the equivalent term is 
referred to as the breeding value. The additive as- 
sumption seems to be well supported for traits with 
a complex genetic basis (Hill, Goddard and Visscher, 
2008), although it is also possible to include latent 
variables corresponding to dominance effects. Un- 
der the additive polygenic assumption, the variance- 
covariance structure of 5 is proportional to the corre- 
lation structure of genotypes coded as allele counts, 
which from (2.1) is proportional to K, the kinship 
matrix. Hence, we assume 

(3.11) 5~ N(0,2a 2 h 2 K), 

where h 2 G [0, 1] is the narrow sense heritability, and 
is defined as the proportion of phenotypic variation 
that can be attributed to additive polygenic effects. 
The residuals are assumed to satisfy 

yi-a-XiP-6i~ N(0,a 2 (l - h 2 )I). 

The origins of this linear MM lie in the partition- 
ing by Fisher (1918) of the variance of a quanti- 
tative trait into independent genetic and environ- 
mental components, and derivation of the genetic 
correlation of trait values of a pair of relatives as- 
suming Mendelian inheritance (Gianola, 2007). It is 
conventional to introduce the 2 in (3.11) because 2K 
reduces to I in the limiting case of completely unre- 
lated and completely outbred individuals, in which 
case becomes inestimable. 

The model (3.10) has long been used for mapping 
quantitative trait loci in outbred pedigrees, using 
a pedigree-based kinship matrix (Hoschele, 2007). 
Yu et al. (2006), who were interested in association 
mapping in maize, were first to suggest using the 
same model to correct association analysis for popu- 
lation structure, but with K estimated from marker 
data. In fact, Yu et al. also include in their model ad- 
ditional population structure terms, namely, the an- 
cestral proportions estimated by the STRUCTURE 
software (Section 3.4). Zhao et al. (2007) used the 
same model to analyze an Arabidopsis thaliana data 
set and, like Yu et al., found that the additional 
terms improved the fit of the model. However, nei- 
ther set of authors formally assess the improvements 
in fit which, by visual inspection of the genome- wide 
p-value distributions, seem modest. In principle, K 
already includes population structure information, 
making the additional terms redundant. However, 
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the structure terms provide a low-dimensional sum- 
mary of key features of K which are likely to be 
better estimated than individual kinships, and this 
may generate some advantage to including the struc- 
ture terms in the regression model as well as K. 
Typically, ~10 5 SNPs are required for adequate es- 
timation of K in human populations, more than are 
required for estimation of PCs, but this number is 
usually available from a GWAS. 

Kang et al. (2008) have developed the software 
package EMMA for fast inference in linear MMs us- 
ing a likelihood ratio test. Alternatively, for very 
large samples one can use the score test which is 
computationally faster because it only requires pa- 
rameter estimates under the null hypothesis (/? = 0). 
Another fast method for inference in mixed models, 
GRAMMAR, has been proposed by Aulchenko, de 
Koning and Haley (2007). Although GRAMMAR is 
faster than EMMA, it is an approximate method 
and the authors found that it could be conservative 
and hence have reduced power. GRAMMAR uses 
the mixed model to predict the phenotype under 
the null hypothesis, 

y = & + 5, 

where 5 is the best linear unbiased predictor (BLUP) 
of 5, which is equivalent to the empirical Bayes esti- 
mate for 8 with prior (3.11) (Robinson, 1991). This 
prediction only needs to be made once for the whole 
data set. The next step is to use the residuals from 
the prediction as the outcome in a linear regression, 

(3.12) y-y = lii + xP + e, 

and test the parameter /3 for each SNP. An assump- 
tion underlying (3.12) is that the residuals are inde- 
pendent and identically distributed, which is strictly 
false. Indeed, both E(y — y) and Var(y — y) are func- 
tions of K unless h 2 = 0, which may be the reason 
that an additional GC-style variance inflation cor- 
rection is often required to calibrate the GRAM- 
MAR type I error rate. 
Note that (3.10) can be reparametrized as 

E[yi\i\ =a + Xif3 + va, 

with 

7~iV(0,2cj 2 /i 2 A), 

where A and v% are defined above at (3.9). Thus, the 
MM approach uses the same latent variables as PC 
adjustment but deals differently with the vector 7 
of nuisance parameters. From a Bayesian point of 



view, both methods put independent priors on the 
components of 7. Using k PCs as regression covari- 
ates can be viewed as assigning to each of the n — k 
trailing components of 7 a prior with unit mass at 
zero, while each of the k leading components receives 
a diffuse prior. These assignments imply certainty 
that the polygenic component of the phenotype is 
fully captured by the first k PCs. In contrast, the 
MM approach puts a Gaussian prior on each com- 
ponent of 7, with variances proportional to the cor- 
responding eigenvalues. 

4. SIMULATIONS 

4.1 Case-Control Studies Without 
Ascertainment Bias 

We show here the results of a small simulation 
study designed to illustrate the merits of some of 
the methods introduced above for correcting GWAS 
analysis for population structure. Although we have 
criticized the island model as unrealistic, it remains 
the most convenient starting point and we use it be- 
low, the only complication considered here being as- 
certainment bias. More extensive and realistic sim- 
ulations will be published elsewhere. 

We simulated data from 500 case-control stud- 
ies, each with 1000 cases and 1000 controls drawn 
from a population of 6000 individuals partitioned 
into three equal-size subpopulations. Ancestral mi- 
nor allele fractions were Uniform [0.05, 0.5] for all 
10,000 unlinked SNPs. For each SNP, we drew sub- 
population allele fractions from the beta-binomial 
model described in Balding and Nichols (1995). Un- 
der this model, a marker with ancestral population 
allele fraction p has subpopulation allele fractions 
that are independent draws from 

fl-F 1-F. A 
Betal — f^P,— pT~(l -P) J, 

where F is Wright's F$t, a measure of population 
divergence (Balding, 2003). In order to discriminate 
among the methods, we simulated a high level of 
population structure, F = 0.1, which is close to 
between-continent levels of human differentiation; 
this is larger than is typical for a well-designed GWAS, 
but some meta-analyses may include populations at 
this level of differentiation. The studies simulated 
here are relatively small by the standards of current 
GWAS, and larger studies will be affected by less 
pronounced structure than that simulated here. 
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Fig. 7. Expected versus observed — log 10 (p -values) for four 
test statics, GC (black), PC10 (blue), MM (red) and MCP 
(green) evaluated at 5 million SNPs (10,000 per dataset) sim- 
ulated under the null of no association in population case-con- 
trol studies (see text for details of the simulation) . 

We simulated the disease phenotype under a logis- 
tic regression model, with 20 SNP markers each as- 
signed allelic odds ratio 1.18. The population disease 
prevalence was 0.18. We performed tests for disease- 
phenotype association for all the markers in each 
data set, using median-adjusted GC, principal com- 
ponent adjustment with 10 PCs (PC10) and the like- 
lihood ratio test from the linear mixed model (MM). 
Note that the PC 10 and MM approaches apply a lin- 
ear regression model to binary outcome data, which 
(as noted in Section 3.6) should be reasonable if the 
casexontrol ratio is not extreme and the effect sizes 
are small. We also performed the score test described 
in Section 3.3. This retrospective model is consis- 
tent with the case-control ascertainment, although 
in fact the resulting score statistic is symmetric in 
x and y. We call this test MCP to stand for Multi- 
variate Gaussian model Conditional on Phenotype. 

The PC10, MM and MCP approaches all require 
specification of K, the kinship matrix of the study 
subjects. Figure 7 shows the observed and expected 
p- values at the null markers, aggregated over the 500 
simulations, using K estimated via genetic correla- 
tions (2.2). We see that the type I error is well cali- 
brated for all the methods except GC, which has a 
conservative null distribution. We repeated the anal- 
ysis using the true K used for the simulation and 
the results are similar (not shown). GC is conserva- 
tive because in this extreme scenario with large F$t 



and a small number of subpopulations, the assump- 
tion that the test statistic asymptotically follows a 
linear-inflated Xi distribution fails. The condition 
(3.7) is not satisfied in the beta-binomial model with 
fixed Fst unless the number of subpopulations in- 
creases with n. 

In order to compare power across the methods, 
we plotted ROC curves for the statistics from the 
four methods using both the true K and the es- 
timated K (Figure 8). GC has lower power than 
the other three methods, even though the ROC cal- 
ibrates its bad false positive rate. When K is used, 
the MCP and MM approaches are equally power- 
ful, and more powerful than the other two methods, 
because both exploit the between-subpopulation in- 
formation. When K is used both the MM and MCP 
methods lose their power advantage over PC10, which 
may be due to the sampling error in the eigenvectors 
of K. PC correction uses only the leading eigenvec- 
tors to adjust the analysis; these are less affected by 
noise and for an island model they contain all the 
population structure signal. For an actual GWAS, 
the MM and MCP methods should show perfor- 
mance somewhere intermediate between the cases 
considered here, because many more than 10,000 
SNPs are available to estimate K. 

4.2 Case-Control Studies With Ascertainment 
Bias 

We repeated the simulation studies described above 
but with ascertainment bias. Specifically, each sim- 
ulated study sampled 50 controls from two of the 
three subpopulations and 900 controls from the re- 
maining subpopulation. This corresponds to a sce- 
nario in which investigators are forced to search 
widely across subpopulations to obtain a sufficient 
number of cases for their study, but are able to re- 
cruit most controls from the local subpopulation. 
The casexontrol ratio varies dramatically over sub- 
populations in this scenario, so that subpopulation 
allele frequencies are strong predictors of case-control 
status. 

Once again, PC10, MM and MCP all have good 
control of type I error, while GC is dramatically 
conservative (not shown). We again plotted ROC 
curves for the statistics from the four methods us- 
ing both K and K (Figure 9). GC shows almost 
no power in these simulations. When K is used the 
MCP and MM approaches are equally powerful and 
more powerful than PC10. Their power advantage 
over PC10 is more substantial than in the previous 
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False positive rale False positive rate 

Fig. 8. ROC plots comparing true and false-positive rates of GC (black), PC10 (blue), MM (red) and MCP (green) estimated 
from data aggregated over 500 simulated retrospective population case-control studies. Left: true kinships, K ; right: estimated 
kinships, K. 



scenario without ascertainment bias, because here 
the leading eigenvectors of K are stronger predictors 
of case-control status. When K is used the MCP and 
MM methods again lose some power compared with 
PC10 and again the MCP method suffers a greater 
loss than the MM test. 

5. SUMMARY 

The theme of our review has been the unifying 
role of the matrix of kinship coefficients, K, and 
its estimate K defined at (2.2). We view popula- 
tion structure and cryptic kinship as the extremes 
of the same confounder, the latent pedigree, and K 
as a good summary of the pedigree for use in adjust- 
ing association analyses. We have also argued that, 
whereas methods are often tested under an island 
model of population structure, these models do not 
provide realistic descriptions of relatedness in hu- 
man populations. 

The median-adjusted Genomic Control (GC) is 
simple to apply and, for association studies with 
moderate sample sizes and small amounts of within 
sample relatedness, it is a satisfactory method for 
protecting against confounding, which requires rel- 
atively few SNPs (~10 2 ). When the study sample is 
drawn from a population with a few, distinct sub- 
populations, GC should be used with caution be- 
cause the x 2 approximation may fail. Our simula- 
tions also confirmed previous reports that GC is very 



sensitive to ascertainment bias. Structured associa- 
tion and regression adjustment may also be used 
with relatively few SNPs, and the latter has impor- 
tant advantages over GC. 

Linear mixed models (MM) and the multivariate 
Gaussian model conditioning on phenotype (MCP) 
explicitly model genetic correlations using K, and 
are respectively appropriate for prospective and ret- 
rospective studies with genome- wide SNP data. They 
are particularly suited to modeling complex patterns 
of kinship, including intermediate scenarios between 
close relationships and large scale structure, which 
can arise in plant genetics, human population iso- 
lates and in animal breeding studies. Previous com- 
putational limitations have largely been overcome 
in recent years. These models also provide powerful 
and computationally efficient methods for analyz- 
ing family data of any structure, and combinations 
of families and apparently unrelated individuals. 

Principal components (PC) adjustment in effect 
eliminates from test statistics the part of the phe- 
notypic signal that can be predicted from large scale 
population structure. In particular, if natural selec- 
tion leads to large allele frequency differences across 
subpopulations at a particular SNP, and case-control 
ratios vary across subpopulations, then spurious as- 
sociations can arise that PC adjustment will con- 
trol, because the SNP genotypes are strongly cor- 
related with subpopulation, whereas the MM and 
MCP methods will not. On the other hand, the MM 
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Fig. 9. ROC plots comparing true and false-positive rates of GC (black), PC10 (blue), MM (red) and MCP (green) estimated 
from data aggregated over 500 simulated population case-control studies with biased ascertainment of controls. Left: true 
kinships, K ; right: estimated kinships, K . 



and MCP methods can gain power over PC adjust- 
ment because they explicitly model the phenotype- 
genotype correlations induced by relatedness and ge- 
netic drift. For example, they should provide bet- 
ter power than PC adjustment when analyzing data 
from human population isolates which are homoge- 
neous for environmental exposures. 

There are probably better ways to extract the sig- 
nals of population structure from an estimate of K 
than those considered here. Selection of the first few 
principal components of K can be viewed form 
of signal denoising, but PC regression adjustment 
does not in general optimize power. It may be pos- 
sible to adapt the MM approach to maintain the 
power advantage over PC while reducing noise, for 
example, by smoothing or truncating the lower or- 
der eigenvalues of K. 
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